HW1 Stat4ACSAI

Group : Benjamin Barda, Yusupha Juwara, Francesco Danese

[NOTE] On interactive plot : The interactive plot from Exercise 2 works only in jupyter notebook (not jupyter lab). In order for the plot to display correctly it needs to be executed by hand (SHIFT+ENTER). I suggest running with 'run all' option and then rerun ONLY that cell by hand ... We have a non interactive plot in the notebook as well but we could not resist from making the plot interactive.

Libraries and python boring stuff

Exercise 1

Suppose that a student’s score X in ML is a number between 0 and 1, and that her score Z in Statistics is also a number between 0 and 1. Suppose further that in the population of all ACSAI students in the world (!), these scores are distributed according to the following joint pdf:

$$ f_{X, Z}(x, z) = \begin{cases} 8 \space (x \space z) && for & 0< z <x < 1 ) \\ 0 &&& otherwise \end{cases} $$

1) Check that $ f_{X,Z}(x, z) $ is a legit joint pdf and plot it in Python. What proportion of students obtain a score greater than 0.5 in Statistics, and what is the probability that a randomly selected student will have a Stat-score exactly equal to 0.5?

a). For a pdf$ f_{X, Z}(x, z)$ to be legit, there are 2 requirements:

$$ \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X, Z}(x, z) \partial z \partial x = 1 \\ \implies \int_{0}^{1} \int_{0}^{x} 8(x z) \partial z \partial x = \int_{0}^{1} 4x \left [z^2 \right]_{0}^{x} \partial x = \int_{0}^{1} 4x^3 \partial x = \left. x^4 \right|_{0}^{1} = 1 $$

b). The plot of the PDF is as follows

c). What proportion of students obtain a score greater than 0.5 in Statistic?

To answer this:

$$f_{Z} (z) = \int_{-\infty}^{\infty} f_{X, Z} (x, z) dx = \int_{z}^{1} 8xz dx = \left. 4x^2 z \right|_{z}^{1} = 4z - 4z^3 = 4z(1-z^2) $$ $$ P(Z > 0.5) = \int_{0.5}^{1} (4z-4z^3) dz = \left [2z^2 - z^4 \right]_{0.5}^{1} = (2-1) - (2*0.5^2 - 0.5^4) = 0.5625 \approx 0.56 $$

Note that we could have found the same with the joint pdf as follows:

$$ \int_{0.5}^1 \int_{z}^1 8xz dx dz = \left. \int_{0.5}^1 4x^2z \right|_{z}^1 dz = \int_{0.5}^1 4z-4z^3 dz= \dots \approx 0.56 $$

Therefore, the proportion of students with marks greater than 0.5 in Statistics is approximately 0.56

d). What is the probability that a randomly selected student will have a Stat-score exactly equal to 0.5?

This answer is simply 0. Why? See the integral below

$$ P(Z=0.5) = \int_{0.5}^{0.5} (4z-4z^3) dz = [ 2z^2 - z^4]_{0.5}^{0.5} = 0 $$

2) Let W = log(Z) be the log-Stat score. Find and then plot its density in Python. What predicted value of W has the smallest mean squared error (MSE)? Find it analytically. Find also the median log-Stat score.

a). Find the pdf of $ W= log(Z) $

$$ F_{W}(w) = P(W \leq w) = P(log(Z) \leq w) = P(Z \leq e^w) = F_{Z}(e^w) $$$$ f_{W}(w) = \begin{cases} \displaystyle \frac{\partial f}{\partial x} F_{Z}(e^w) = f_{Z}(e^w) \displaystyle \frac{\partial f}{\partial x} e^w = (4e^w-4e^{3w})e^w = 4e^{2w}-4e^{4w} && for -\infty<w<0 \end{cases} $$

b). Plot W=log(Z)'s pdf in python

c). What predicted value of W has the smallest mean squared error (MSE)? Find it analitically

We would like to estimate W without observing anything. What would be our best estimate of W in that case?. The MSE is given by

$$ MSE = E[(\hat\theta_n - \theta)^2] = E[\hat\theta_n^2] - 2\theta E[\hat\theta_n] + \theta^2$$

The best estimate is where the error is minimum. The error is minimum when $\displaystyle\frac{\partial MSE}{\partial \hat\theta_n} = 0$.

Thus, $2E[\hat\theta_n] - 2\theta = 0 \implies \theta = E[\hat\theta_n]$

We now find $E[\hat\theta_n]$, where W = $\hat\theta_n$

$$ \int_{-\infty}^0 w (4e^{2w} - 4e^{4w}) dw = \underbrace{ w(2e^{2w}-e^{4w}) |_{-\infty}^0 }_{ 0} - \int_{-\infty}^0 (2e^{2w}-e^{4w}) dw = \int_{-\infty}^0 e^{4w}-2e^{2w} dw = \\ \left [ \frac{e^{4w}}{4} - e^{2w} \right]_{-\infty}^0 = \frac{1}{4} - 1 = - \frac{3}{4} $$

3) Assuming a student got 0.8 in ML, find analytically the best MSE predictor for her Stat-score.

From C above, if we have observed X=x, we can repeat the same argument. The only difference is that everything is conditioned on X=0.8. More specifically, the MSE is given by

$$ MSE = E[(Z- \theta)^2|X=0.8] = E[Z^2|X=0.8] - 2\theta E[Z|X=0.8] + \theta^2$$

Again, we obtain a quadratic function of $\theta$, and by differentiation we obtain the MMSE estimate of Z given X=0.8 as

$$E[Z|X=0.8]$$

Now let's find this conditional expectation.

$$ E[Z|X=0.8] = \int_0^1 z f_{Z|X}(z|x) \partial z = \int_0^1 z \frac{f_{Z, X=0.8}(x, z)}{f_X (0.8)} \partial z = \int_0^1 z \frac{8(x=0.8)z}{4(x=0.8)^3} \partial z = \int_0^1 \frac{8z^2}{4(x=0.8)^2} \partial z = \int_0^1 \frac{25z^2}{8} \partial z \\ = \left. \frac{25z^3}{8*3} \right|_0^1 = \frac{25}{24} $$

Where $f_X(x) = \int_0^x 8xz dz = 4xz^2|_0^x = 4x^3$

Exercise 2

We have the following probability density function:
$$f(x|\alpha) = \frac{1}{2\pi}(1 + \alpha\cos(x)) \quad\quad for \space x\in [0,2\pi] \quad and \quad \alpha\in [-1/3,1/3]$$

In order to check that is a valid PDF we must ensure 2 properties:


For the integral:

$$\int\frac{1}{2\pi}(1 + \alpha\cos(x))\ dx \space=\space \frac{1}{2\pi}\int (1 + \alpha\cos(x))\ dx \space=\space \frac{1}{2\pi}\space [\space\int dx\ + \alpha \int \cos(x)\ dx\space] \space=\space \frac{1}{2\pi}\space [x + \alpha\sin(x)]$$

Let's evaluate at $|_{0}^{2\pi}$

$$ \frac{1}{2\pi}\space [2\pi - 0] \space=\space \frac{2\pi}{2\pi} \space=\space 1$$


For the non-negativity i take the first derivative of the function and find the minimum(s):

$for \space\alpha > 0 :$
$f^{'}(x) = - \frac{\alpha}{2\pi}\sin(x) > 0\quad =>\quad \sin(x) < 0\quad =>\quad \pi < x < 2\pi$
$f^{'}(x) = - \frac{\alpha}{2\pi}\sin(x) < 0 \quad =>\quad \sin(x) > 0 \quad =>\quad 0 > x > \pi$

The derivative is > 0 when x > $\pi$ and < 0 when x <$\space\pi$ , therefore $\pi$ is a minimum
Now let's check if the function at its minimum is negative. We proceed by contradiction supposing $f$ is less than 0 at the minimum:

$f(\pi) < 0 \quad =>\quad \frac{1-\alpha}{2\pi} < 0 \quad =>\quad 1 - \alpha < 0 \quad =>\quad \alpha > 1$

And we see that at the minimum the function is < 0 only when $\alpha > 1$ , but $\alpha\in (0, 1/3]$ so $f(\pi) \ge 0 \space\forall \alpha$
Since $f(x)$ is always greater than zero at its minimum for any $\alpha$, it's always greater than zero in any point in $S$

Based on the same reasoning, one can easily prove that even when $\alpha < 0$ and $\alpha = 0$ the function is non-negative everywhere.
it's easy to check it, playing with the interactive graph below, you can change alpha as you prefer:

IMPORTANT : just run manually ONLY the cell below to see the interactive graph


To find the method of moments estimator let's compute the mean of the distribution:

$$\mu_{\chi} = \frac{1}{2\pi}\int_{0}^{2\pi}x(1+\alpha \cos(x))\space dx \space = \space \frac{1}{2\pi}\space(\int x\space dx + \alpha \int x \cos(x) dx \space) \space = \space \frac{1}{2\pi}\space(\frac{x^2}{2}+x\alpha sin(x) + \alpha cos(x)) $$


Let's evaluate at $|_{0}^{2\pi}$ $$ \frac{4\pi^2 + 2\alpha}{4\pi} - \frac{\alpha}{2\pi} = \frac{2\pi^2 +\alpha}{2\pi} - \frac{\alpha}{2\pi} = \frac{2\pi^2}{2\pi} = \pi $$

The mean is $\pi$ for all values of $\alpha$ so we have to compute the second moment to obtain an estimator:

$$\mu_2 = \frac{1}{2\pi}\int_{0}^{2\pi}x^2(1+\alpha \cos(x))\space dx \space = \space \frac{1}{2\pi} [\int x^2 \space dx + \int \alpha x^2 \cos(x) \space dx] =$$$$ \frac{x^3}{3} + x^2\alpha \sin(x) + 2x\alpha \cos(x) - 2\alpha\sin(x)$$


Let's evaluate at $|_{0}^{2\pi}$

$$\frac{2\pi(4\pi^2 + 6\alpha)}{6\pi} - 0 = \frac{4\pi^2 +6\alpha}{3} = \frac{4}{3}\pi^2 + 2\alpha $$


Therefore $\mu_2 = \frac{4}{3}\pi^2 + 2\alpha\space$ that leads to $$\space\hat{\alpha} = \frac{1}{2}(\mu_2- \frac{4}{3}\pi^2)$$
where $\mu_2$ means $\overline{x^2}$ that is the sample mean of the square of the observations.

Based on the small dataset, we found $\hat{\alpha} \approx 1.356$


For the log-likelihood function we apply the natural logarithm to the likelihood function of our pdf:

$$\boldsymbol\ell (\alpha) = \ln\prod_{i=1}^{n} f(x_i|\alpha)) = \sum^{n}\ln(\frac{1}{2\pi}(1 + \alpha\cos(x_i))) = \sum^{n}\ln(\frac{1}{(2\pi)}) + \sum^{n}\ln(1 + \alpha\cos(x_i)) = $$

$$\ln(\frac{1}{(2\pi)^n}) + \sum^{n}\ln(1 + \alpha\cos(x_i))$$

We now proceed to implement it in python and plot it passing the vector $x_{20}$


In order to find the Maximum Likelihood Estimator we need to determine the point at which $\ell (\alpha)$ is maximized. To do this, we can differentiate the likelihood function and set it equal to 0:

$$\frac{\partial \ell (\alpha)}{\partial \alpha} = \sum^{n}\frac{\cos(x_i)}{1 + \alpha\cos(x_i)} = 0$$

Solve for $\alpha$ is pretty complicated, so let's python find the maximum for us using L-BFGS.

$\quad$ First without gradient:



Now with gradient, using our derivative previously found:


As we can see, passing also the gradient saves half of the evaluations. Math is good 🙃
In both cases we have found $MLE \approx 0.1907$

Now let's proceed to plot the cdf of the raw data provided along with the cdf of the density using MLE and MoM estimates as $\alpha$ To do so we first find a generic cdf:

$$\int_{0}^{k}\frac{1}{2\pi}(1 + \alpha\cos(x))\ dx \space=\space \frac{1}{2\pi}\int_{0}^{k} (1 + \alpha\cos(x))\ dx \space=\space \frac{1}{2\pi}\space [\space\int_{0}^{k} dx\ + \alpha \int_{0}^{k} \cos(x)\ dx\space] \space=\space\frac{1}{2\pi}\space [k - 0 + \alpha\sin(k) - 0]\space=\space\frac{1}{2\pi}\space [k + \alpha\sin(k)]$$

And then plot this cdf using $\alpha =$ 0.1907 (MLE) and $\alpha = $1.356 (MoM)

Based on this graph, we could say the MLE fits pretty well, on the other hand MoM shows clearly some lack-of-fit. I think this depends on the fact that the sample size of the data provided was quite small (20 elements) and hence it affected the precision of MoM estimator.

Exercise number 3

Helper Functions

[3.1] Data cleaning and selection

Import the data, and look at the automakers listed in the variable make. Select a large one and obtain a derived dataset by filtering out all the others.

I am more intrested in the unique values not really on the raw count ... so let's drop duplicates

and check for Null Values

We have two choices at this point: TOYOTA or Mercedes-Benz.

For the sake of this exercise let us look at the biggest one Toyota.

[3.2 / 3] Selection and plot

2. Look at the other five numeric variables available, and pick the two that you feel are the most relevant to perform the task described. Briefly explain your choice

3. Check the presence of missing values, filter them out and then visualize and numerically summarize the distribution of the two variables selected

I suspect that number_gears would not be very informative compared to engine size (displacement, number_cyl) and consumption (city_mpg, hwy_mpg).

In fact as we can see from it's density as expected most of the cars have either 6 or 8 gears so i would discard this column as a candidate.

Also I would avoid choosing two higly correlated features. So i would procede choosing one from the 'consumption' category and one from the engine size group.

Following my reasoning i would pick the attributes with the most information, that is city_mpg and displacement ... Let's plot their densities

I am happy with my choice so we can procede with the analysis

[3.4] MoG and Cross Validation

4. Cluster the data using MoG by selecting the number of components via a 70%-30% sample-splitting scheme that you implement from scratch (just the evaluation, for the fit use of course scikit-learn). Compare the results with the AIC selection (again, use the implementation in scikit-learn). Beside the (possibly different) optimal number of components, can you figure out a way to evaluate some form of “agreement” between the two clusterizations obtained? In general, do you think that one is any better than the other? Explain, possibly with relevant plots or stats.

The question is how many components should our gaussian mixture have ? we use cross validation to check.

The lower is better and is it easy to see that 4 is the best choice

We can see that the model we chose really do not find relevant the displacement. This might be because of the very big diffrence scale of the two

So maybe scaling might be an option

As we can see a much more reasonable clustering, since city_mpg do not dominate the other feature

[3.5] K-MEANS

5. Repeat the analysis using k-mean by selecting k via the Elbow method. Compare with the previous results.

Let's try to do it on the scaled dataset

Considerations

In my opinion just looking at two features out of five of one single manufacturer it is indeed impossible to infer the type of vehicle. In fact as an example think of an hybrid car ... Assuming the data is correct an hybrid car would have a very high city_mpg (low consumption) but a very low highway_mpg, because during a trip on an highway the electric engine would not be engaged at all (almost).

One way to get around this problem would be to repeat the analysis unifying the consumption columns (highway_mpg,city_mpg),According to the report given in the exercise we can compute the overall consumption by considering a weighted sum of the two (45% for the first and 55% for the latter).

We can think at the clusters as follow

  1. High Consumption Big engine : Those could be SUVs
  2. high Consumption medium sized engine : those could be sports car
  3. Moderate Consumption small engine : We can think of family cars
  4. low consumption small engine : city cars / hybrid